Name : Jagadeesh Nadimpalli
IdentiKey: jana3207
Overview: The Walmart Store Sales Dataset provides sales data from 45 Walmart stores located in various regions of the United States. The dataset covers the time period from February 5, 2010, to November 1, 2012. The data is utilised for the purpose of forecasting sales and demand, specifically during crucial holiday weeks when sales increases are expected.
Features:
Store: A unique identifier assigned to each of the 45 Walmart stores. The stores are assigned sequential numbers ranging from 1 to 45.Date: The date that corresponds to the week of sales. The data records weekly sales figures, offering valuable insights into weekly patterns.Weekly_Sales: The total sales amount for a given store in a specific week. This is the response variable for predictive modeling.Holiday_Flag: A binary indicator where '1' indicates a holiday week and '0' indicates a non-holiday week.Holiday weeks include special events such as the Super Bowl, Labour Day, Thanksgiving, and Christmas.Temperature: The local temperature on the day of the sale, which could impact sales trends.Fuel_Price: The price of fuel in the specific area on the day of purchase. Fluctuations in fuel prices can have a significant impact on both the purchasing power of consumers and the expenses associated with transportation and distribution.CPI: The Consumer Price Index for the region at the time of sale, which measures the level of inflation and economic conditions that could affect sales.Unemployment: The unemployment rate in the region at the time of sale, which can affect consumer spending habits.This dataset is particularly useful for building machine learning models to predict future sales, examining the influence of economic indicators on sales, and strategizing inventory and staffing for anticipated surges during holiday seasons. Markdown events and promotions, which precede major holidays, are critical in the analysis as they significantly influence weekly sales figures.
Dataset Link : https://www.kaggle.com/datasets/yasserh/walmart-dataset/data
What are the most important factors influencing weekly sales at Walmart, and how do seasonal variations, specific holidays, and economic indicators such as CPI and fuel prices affect consumer spending patterns and sales performance across store locations?
#installing necessary packages
install.packages("lubridate")
install.packages('gridExtra')
install.packages('corrplot')
install.packages("leaps")
install.packages('MASS')
install.packages('car')
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) also installing the dependencies ‘numDeriv’, ‘SparseM’, ‘MatrixModels’, ‘minqa’, ‘nloptr’, ‘Rcpp’, ‘RcppEigen’, ‘carData’, ‘abind’, ‘pbkrtest’, ‘quantreg’, ‘lme4’
#importing necessary libraries
library(tidyverse)
library(ggplot2)
library(lubridate)
library(gridExtra)
library(corrplot)
library(leaps)
library(MASS)
library(car)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.4.4 ✔ tibble 3.2.1 ✔ lubridate 1.9.3 ✔ tidyr 1.3.1 ✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ purrr::%||%() masks base::%||%() ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors Attaching package: ‘gridExtra’ The following object is masked from ‘package:dplyr’: combine corrplot 0.92 loaded Attaching package: ‘MASS’ The following object is masked from ‘package:dplyr’: select Loading required package: carData Attaching package: ‘car’ The following object is masked from ‘package:dplyr’: recode The following object is masked from ‘package:purrr’: some
#reading the csv file
walmart_data = read.csv("/content/Walmart.csv")
head(walmart_data)
| Store | Date | Weekly_Sales | Holiday_Flag | Temperature | Fuel_Price | CPI | Unemployment | |
|---|---|---|---|---|---|---|---|---|
| <int> | <chr> | <dbl> | <int> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 1 | 1 | 5/2/10 | 1643691 | 0 | 42.31 | 2.572 | 211.0964 | 8.106 |
| 2 | 1 | 12/2/10 | 1641957 | 1 | 38.51 | 2.548 | 211.2422 | 8.106 |
| 3 | 1 | 19-02-2010 | 1611968 | 0 | 39.93 | 2.514 | 211.2891 | 8.106 |
| 4 | 1 | 26-02-2010 | 1409728 | 0 | 46.63 | 2.561 | 211.3196 | 8.106 |
| 5 | 1 | 5/3/10 | 1554807 | 0 | 46.50 | 2.625 | 211.3501 | 8.106 |
| 6 | 1 | 12/3/10 | 1439542 | 0 | 57.79 | 2.667 | 211.3806 | 8.106 |
The data cleaning process is crucial for every dataset, and it involves the following steps:
We follow to the above sequence in order to prepare our data for analysis.
#checking the dimensions of the dataset (shape)
dim(walmart_data)
#checking for data types of each column in the dataframe
str(walmart_data)
'data.frame': 6435 obs. of 8 variables: $ Store : int 1 1 1 1 1 1 1 1 1 1 ... $ Date : chr "5/2/10" "12/2/10" "19-02-2010" "26-02-2010" ... $ Weekly_Sales: num 1643691 1641957 1611968 1409728 1554807 ... $ Holiday_Flag: int 0 1 0 0 0 0 0 0 0 0 ... $ Temperature : num 42.3 38.5 39.9 46.6 46.5 ... $ Fuel_Price : num 2.57 2.55 2.51 2.56 2.62 ... $ CPI : num 211 211 211 211 211 ... $ Unemployment: num 8.11 8.11 8.11 8.11 8.11 ...
#lowercasing the column names of the dataframe
names(walmart_data) = tolower(names(walmart_data))
#checking for null values
print(colSums(is.na(walmart_data)))
store date weekly_sales holiday_flag temperature fuel_price
0 0 0 0 0 0
cpi unemployment
0 0
# convert date column of string type to date type using lubricant library
walmart_data$date = gsub("-", "/", walmart_data$date)
walmart_data$date = parse_date_time(walmart_data$date, orders = c("d/m/y", "d/m/Y"))
walmart_data$date = format(walmart_data$date, "%m-%d-%Y")
walmart_data$date = as.Date(walmart_data$date,'%m-%d-%Y')
#converting the store and holiday_flag variables to factor variables
walmart_data$store = factor(walmart_data$store)
walmart_data$holiday_flag = factor(walmart_data$holiday_flag, levels = c(0, 1), labels = c("non_holiday_week", "holiday_week"))
#reorder the weekly_sales column to last
walmart_data = walmart_data[, c('store', 'date', 'holiday_flag', 'temperature', 'fuel_price', 'cpi', 'unemployment', 'weekly_sales')]
#checking for duplicated rows
sum(duplicated(walmart_data))
#extracting the response variable and predictors into different variables
response_variable = names(walmart_data)[8]
features = names(walmart_data)[-8]
The dataset does not contain any null values or duplicated values, as indicated by the results above.
We have converted the store and holidayflag columns to factor variables in R because they appear to have categorical values. According to the dataset description, we have changed the values of holidayflag to 0 for non-holiday weeks and 1 for holiday weeks.
#checking the data after cleaning and also datatypes
head(walmart_data)
| store | date | holiday_flag | temperature | fuel_price | cpi | unemployment | weekly_sales | |
|---|---|---|---|---|---|---|---|---|
| <fct> | <date> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 1 | 1 | 2010-02-05 | non_holiday_week | 42.31 | 2.572 | 211.0964 | 8.106 | 1643691 |
| 2 | 1 | 2010-02-12 | holiday_week | 38.51 | 2.548 | 211.2422 | 8.106 | 1641957 |
| 3 | 1 | 2010-02-19 | non_holiday_week | 39.93 | 2.514 | 211.2891 | 8.106 | 1611968 |
| 4 | 1 | 2010-02-26 | non_holiday_week | 46.63 | 2.561 | 211.3196 | 8.106 | 1409728 |
| 5 | 1 | 2010-03-05 | non_holiday_week | 46.50 | 2.625 | 211.3501 | 8.106 | 1554807 |
| 6 | 1 | 2010-03-12 | non_holiday_week | 57.79 | 2.667 | 211.3806 | 8.106 | 1439542 |
str(walmart_data)
'data.frame': 6435 obs. of 8 variables: $ store : Factor w/ 45 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... $ date : Date, format: "2010-02-05" "2010-02-12" ... $ holiday_flag: Factor w/ 2 levels "non_holiday_week",..: 1 2 1 1 1 1 1 1 1 1 ... $ temperature : num 42.3 38.5 39.9 46.6 46.5 ... $ fuel_price : num 2.57 2.55 2.51 2.56 2.62 ... $ cpi : num 211 211 211 211 211 ... $ unemployment: num 8.11 8.11 8.11 8.11 8.11 ... $ weekly_sales: num 1643691 1641957 1611968 1409728 1554807 ...
#extracting the numerical columns from the dataset
num_features = sapply(walmart_data, is.numeric)
num_features = names(walmart_data)[num_features]
num_features
#plotting the box plots for all numerical features to check for the outlier in the data and remove them
plot_temp = ggplot()+
geom_boxplot(aes(x = walmart_data$temperature))+
ggtitle("Box plot of Temperature")
plot_fp = ggplot()+
geom_boxplot(aes(x = walmart_data$fuel_price))+
ggtitle("Box plot of fuel_price")
plot_cpi = ggplot()+
geom_boxplot(aes(x = walmart_data$cpi))+
ggtitle("Box plot of CPI")
plot_uemp = ggplot()+
geom_boxplot(aes(x = walmart_data$unemployment))+
ggtitle("Box plot of unemployment")
plot_ws = ggplot()+
geom_boxplot(aes(x = walmart_data$weekly_sales))+
ggtitle("Box plot of weeklysales")
grid.arrange(plot_temp, plot_fp, plot_cpi, plot_uemp, plot_ws, ncol = 3, nrow = 2)
cat("\nInference:\nBefore removal of outliers, The dataset had", nrow(walmart_data), "samples.\n")
num_f = num_features[-(length(num_features))]
for(i in num_f) {
Q1 = quantile(walmart_data[[i]], 0.25, na.rm = TRUE)
Q3 = quantile(walmart_data[[i]], 0.75, na.rm = TRUE)
IQR = Q3 - Q1
lower_bound = Q1 - (1.5 * IQR)
upper_bound = Q3 + (1.5 * IQR)
walmart_data = subset(walmart_data, walmart_data[[i]] <= upper_bound & walmart_data[[i]] >= lower_bound)
}
cat("After removal of outliers, The dataset now has", nrow(walmart_data), "samples.\n")
Inference: Before removal of outliers, The dataset had 6435 samples. After removal of outliers, The dataset now has 5951 samples.
We have removed the outliers by employing the Interquartile Range (IQR) method. If the data points fall within the upper and lower bounds of the IQR we keep them. Otherwise we classify those data points as outliers and exclude them.
The original dataset consisted of 6435 samples. After removing the outliers, we were left with 5951 samples resulting in the removal of 484 outliers.
# Plotting overall sales over time
sales_by_date = aggregate(weekly_sales ~ date, data = walmart_data, sum)
ggplot(sales_by_date,aes(x = date,y=weekly_sales))+
geom_line()+
ggtitle("Overall Weekly Sales Over Time")+
xlab("Date")+
ylab("Total Weekly Sales by date")
#plotting sales by each store
sales_by_store = aggregate(weekly_sales ~ store, data=walmart_data, sum)
ggplot(sales_by_store, aes(x=factor(store), y=weekly_sales)) +
geom_bar(stat="identity") +
ggtitle("Total Sales by Store")+
xlab("Stores")+
ylab("Total Sales")
#plotting boxplot of sales vs weeklysales
ggplot(walmart_data, aes(x=store, y=weekly_sales)) +
geom_boxplot() +
ggtitle("Distribution of Weekly Sales Across Stores")+
xlab("Store Number")+
ylab("Weekly Sales")
ggplot(walmart_data, aes(x=holiday_flag, y=weekly_sales, fill=holiday_flag)) +
geom_boxplot() +
ggtitle("Impact of Holidays on Weekly Sales")+
xlab("Holiday Flag")+
ylab("Weekly Sales")
# Sales vs. CPI
ggplot(walmart_data, aes(x=cpi, y=weekly_sales)) +
geom_point()+
geom_smooth(method='lm',se = F) +
ggtitle("Weekly Sales vs. CPI")+
xlab("Consumer Price Index (CPI)")+
ylab("Weekly Sales")
`geom_smooth()` using formula = 'y ~ x'
# Sales vs. Unemployment
ggplot(walmart_data, aes(x=unemployment, y=weekly_sales)) +
geom_point() +
geom_smooth(method='lm',se= F )+
ggtitle("Weekly Sales vs. Unemployment")+
xlab("Unemployment Rate")+
ylab("Weekly Sales")
`geom_smooth()` using formula = 'y ~ x'
#plotting the correlation matrix of the walmart dataset of numerical features
cor_matrix = cor(walmart_data[, num_features], use = "complete.obs")
col4 = colorRampPalette(c("black", "darkgrey", "grey", "#CFB87C"))
corrplot(cor_matrix, method="ellipse", col=col4(100), addCoef.col="black", tl.col="black")
#one-hot encoding for the factor variables(store and holiday_flag)
store_dummies = model.matrix(~ store, data = walmart_data)[,-1]
holiday_flag_dummies = model.matrix(~ holiday_flag-1, data = walmart_data)
store_dummies = as.tibble(store_dummies)
holiday_flag_dummies = as.tibble(holiday_flag_dummies)
holiday_flag_dummies = holiday_flag_dummies[2]
head(store_dummies)
head(holiday_flag_dummies)
Warning message: “`as.tibble()` was deprecated in tibble 2.0.0. ℹ Please use `as_tibble()` instead. ℹ The signature and semantics have changed, see `?as_tibble`.”
| store2 | store3 | store4 | store5 | store6 | store7 | store8 | store9 | store10 | store11 | ⋯ | store36 | store37 | store38 | store39 | store40 | store41 | store42 | store43 | store44 | store45 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| holiday_flagholiday_week |
|---|
| <dbl> |
| 0 |
| 1 |
| 0 |
| 0 |
| 0 |
| 0 |
#merging the walmart data with encoded variables(store and holiday_flag) and dropping the original variables
walmart_data_e = cbind(store_dummies, holiday_flag_dummies,walmart_data)
walmart_data_e$store = NULL
walmart_data_e$holiday_flag = NULL
head(walmart_data_e)
| store2 | store3 | store4 | store5 | store6 | store7 | store8 | store9 | store10 | store11 | ⋯ | store43 | store44 | store45 | holiday_flagholiday_week | date | temperature | fuel_price | cpi | unemployment | weekly_sales | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <date> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2010-02-05 | 42.31 | 2.572 | 211.0964 | 8.106 | 1643691 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 1 | 2010-02-12 | 38.51 | 2.548 | 211.2422 | 8.106 | 1641957 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2010-02-19 | 39.93 | 2.514 | 211.2891 | 8.106 | 1611968 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2010-02-26 | 46.63 | 2.561 | 211.3196 | 8.106 | 1409728 |
| 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2010-03-05 | 46.50 | 2.625 | 211.3501 | 8.106 | 1554807 |
| 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2010-03-12 | 57.79 | 2.667 | 211.3806 | 8.106 | 1439542 |
#extracting the numerical columns of the train data
num_train_f = sapply(walmart_data_e, is.numeric)
num_train_f = names(walmart_data_e)[num_train_f]
num_train_f
Objective: To determine if there's a significant difference in average weekly sales during holiday weeks compared to non-holiday weeks.
To solve this firstly Defining Hypothesis
Null Hypothesis (H0): There is no difference in average weekly sales between holiday weeks and non-holiday weeks.
$$\mu_{holiday} = \mu_{non-holiday}$$
$$\mu_{holiday} \neq \mu_{non-holiday}$$
#extracting the holiday sales and non-holidays from the walmart data
holiday_sales = walmart_data_e$weekly_sales[walmart_data_e$holiday_flagholiday_week == 1]
non_holiday_sales = walmart_data_e$weekly_sales[walmart_data_e$holiday_flagholiday_week == 0]
result = t.test(holiday_sales,non_holiday_sales)
cat("----------Result of Hypothesis :\n")
print(result)
if (result$p.value < 0.05)
{
cat("Reject the null hypothesis\n")
} else
{
cat("Fail to reject the null hypothesis\n")
}
----------Result of Hypothesis : Welch Two Sample t-test data: holiday_sales and non_holiday_sales t = 2.5678, df = 468.6, p-value = 0.01054 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 19276.42 144960.90 sample estimates: mean of x mean of y 1127361 1045243 Reject the null hypothesis
Interpretation of Hypothesis test:
Objective : Estimating confidence intervals for mean weekly sales during Super Bowl, Labor Day, Thanksgiving, and Christmas.
superbowl_dates = as.Date(c("2010-02-12", "2011-02-11", "2012-02-10"))
laborday_dates = as.Date(c("2010-09-10", "2011-09-09", "2012-09-07"))
thanksgiving_dates = as.Date(c("2010-11-26", "2011-11-25", "2012-11-23"))
christmas_dates = as.Date(c("2010-12-31", "2011-12-30", "2012-12-28"))
#extracting weekly sales of those above dates from the train_walmart
superbowl_sales = walmart_data_e$weekly_sales[walmart_data_e$date %in% superbowl_dates]
laborday_sales = walmart_data_e$weekly_sales[walmart_data_e$date %in% laborday_dates]
thanksgiving_sales = walmart_data_e$weekly_sales[walmart_data_e$date %in% thanksgiving_dates]
chirstmas_sales = walmart_data_e$weekly_sales[walmart_data_e$date %in% christmas_dates]
#calculating CI for every holiday sales:
t_test_superbowl = t.test(superbowl_sales)
t_test_laborday = t.test(laborday_sales)
t_test_thanksgiving= t.test(thanksgiving_sales)
t_test_christmas = t.test(chirstmas_sales)
print("Confidence interval for superblow :")
print(t_test_superbowl)
print("Confidence interval for laborday :")
print(t_test_laborday)
print("Confidence interval for thanksgiving :")
print(t_test_thanksgiving)
print("Confidence interval for chirstmas :")
print(t_test_christmas)
[1] "Confidence interval for superblow :" One Sample t-test data: superbowl_sales t = 20.497, df = 123, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 977755.2 1186785.5 sample estimates: mean of x 1082270 [1] "Confidence interval for laborday :" One Sample t-test data: laborday_sales t = 21.626, df = 125, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 944327 1134579 sample estimates: mean of x 1039453 [1] "Confidence interval for thanksgiving :" One Sample t-test data: thanksgiving_sales t = 16.901, df = 83, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 1308771 1657895 sample estimates: mean of x 1483333 [1] "Confidence interval for chirstmas :" One Sample t-test data: chirstmas_sales t = 17.433, df = 83, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 859164.9 1080464.4 sample estimates: mean of x 969814.7
Interpretation of confidence interval results:
Objective: Testing for differences in sales variability during holiday weeks versus non-holiday weeks
hoiday_sales = walmart_data_e$weekly_sales[walmart_data_e$holiday_flagholiday_week == 1]
non_holiday_sales = walmart_data_e$weekly_sales[walmart_data_e$holiday_flagholiday_week == 0]
F_test_results = var.test(hoiday_sales,non_holiday_sales, alternative = "two.sided")
F_test_results
F test to compare two variances
data: hoiday_sales and non_holiday_sales
F = 1.2546, num df = 417, denom df = 5532, p-value = 0.001027
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.094406 1.450702
sample estimates:
ratio of variances
1.254588
Interpretation of F-test results:
The F-test results indicate that the variances are significantly different, as the p-value of 0.001027 is below the standard alpha value of 0.05.
And the F-value is 1.2546. The 95% confidence interval for the ratio of variances are in range between to be 1.094 to 1.451. It is important to note that the value of 1 is not included within this range.
Objective: Check if there are significant differences in weekly sales among different Walmart store locations?
anova_result = aov(weekly_sales ~ store, data = walmart_data)
summary(anova_result)
Df Sum Sq Mean Sq F value Pr(>F) store 44 1.788e+15 4.064e+13 1496 <2e-16 *** Residuals 5906 1.605e+14 2.717e+10 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpretation of the ANOVA test:
#Residual vs Fitted values for constant variance
plot(anova_result,1)
leveneTest(weekly_sales ~ store, data = walmart_data)
| Df | F value | Pr(>F) | |
|---|---|---|---|
| <int> | <dbl> | <dbl> | |
| group | 44 | 17.76924 | 1.616619e-126 |
| 5906 | NA | NA |
Interpretation for Constant Variance:
Based on the Residual vs Fitted plot and the Levene test, we can tell whether the ANOVA test follows constant variance or not.
The p-value in the test is below the significance level of 0.05. This implies that there is substantial evidence which indicates that the variation among the groups is statistically significant.
We can concluded that the homogeneity of variances in the different store groups.
plot(anova_result,2)
shapiro_test = shapiro.test(sample(residuals(anova_result), 5000))
print(shapiro_test)
Shapiro-Wilk normality test data: sample(residuals(anova_result), 5000) W = 0.69415, p-value < 2.2e-16
Interpretation for Normality Assumption:
Based on the Q-Q plot and Shapiro-Wilk test, we can determine whether the ANOVA result follows to normality or not.
Null Hypothesis for assumption is the normality is not violating.
Alternative Hypothess for the assumption is that normality is violating.
Given the extremely small p-value, we can reject the null hypothesis and infer that there is statistical evidence indicating that the data is not normally distributed.
kruskal.test(weekly_sales ~ store, data = walmart_data)
Kruskal-Wallis rank sum test data: weekly_sales by store Kruskal-Wallis chi-squared = 5690.2, df = 44, p-value < 2.2e-16
The Walmart encoded dataset exhibits varying ranges among its features. To address this, we will normalise all the numerical features of the encoded data to a standardised scale ranging from 0 to 1.
In order to conduct regression modelling on our dataset, it is necessary to perform feature scaling. This will allow for a more detailed understanding of the features.
# Distribution of weekly sales before feature scaling
ggplot()+
geom_histogram(aes(x = walmart_data_e$weekly_sales, y = ..density..),fill = 'green',bins = 30)+
geom_density(aes(x = walmart_data_e$weekly_sales))+
ggtitle("Distribution plot of target(weekly_Sales) in walmart_data")
Warning message:
“The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.”
print(num_features)
[1] "temperature" "fuel_price" "cpi" "unemployment" "weekly_sales"
# feature scaling the numerical features between [0,1].
feature_scaling = function(x) {
fs = (x - min(x))/(max(x) - min(x))
return(fs)
}
walmart_data_e[num_features] = lapply(walmart_data_e[num_features], feature_scaling)
# Distribution of weekly sales after feature scaling
ggplot()+
geom_histogram(aes(x = walmart_data_e$weekly_sales, y = ..density..),fill = 'green',bins = 30)+
geom_density(aes(x = walmart_data_e$weekly_sales))+
ggtitle("Distribution plot of target(weekly_Sales) in walmart_data")
#randomly splitting the train and test from the encoded walmart data(below code from inclass notebook)
set.seed(123456)
n = floor(0.8 * nrow(walmart_data_e)) #find the number corresponding to 80% of the data
index = sample(seq_len(nrow(walmart_data_e)), size = n) #randomly sample indicies to be included in the training set
train_walmart = walmart_data_e[index, ] #set the training set to be the randomly sampled rows of the dataframe
test_walmart = walmart_data_e[-index, ] #set the testing set to be the remaining rows
cat("There are", dim(train_walmart)[1], "rows and",dim(train_walmart)[2],"columns in the training set. ") #check the dimensions
cat("\nThere are", dim(test_walmart)[1], "rows and",dim(test_walmart)[2],"columns in the testing set.") #check the dimensions
There are 4760 rows and 51 columns in the training set. There are 1191 rows and 51 columns in the testing set.
head(train_walmart)
| store2 | store3 | store4 | store5 | store6 | store7 | store8 | store9 | store10 | store11 | ⋯ | store43 | store44 | store45 | holiday_flagholiday_week | date | temperature | fuel_price | cpi | unemployment | weekly_sales | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <date> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 3320 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2010-09-03 | 0.7127751 | 0.1943888 | 0.06583886 | 0.5755515 | 0.32044650 |
| 234 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2011-10-28 | 0.6302331 | 0.4509018 | 0.90206839 | 0.4734059 | 0.43209741 |
| 1084 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2011-09-02 | 0.8341606 | 0.5315631 | 0.92057053 | 0.3198852 | 0.18882170 |
| 5122 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2012-04-27 | 0.6754424 | 0.7094188 | 0.93148720 | 0.4051073 | 0.02726842 |
| 2430 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2012-10-19 | 0.4391454 | 0.6558116 | 0.05027209 | 0.1841946 | 0.20710243 |
| 2935 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2011-07-08 | 0.8815278 | 0.5050100 | 0.87838957 | 0.5355092 | 0.14523599 |
head(test_walmart)
| store2 | store3 | store4 | store5 | store6 | store7 | store8 | store9 | store10 | store11 | ⋯ | store43 | store44 | store45 | holiday_flagholiday_week | date | temperature | fuel_price | cpi | unemployment | weekly_sales | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <date> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2010-02-26 | 0.4226370 | 0.04458918 | 0.8427068 | 0.5738894 | 0.3324580 |
| 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2010-03-19 | 0.5084161 | 0.12424850 | 0.8416788 | 0.5738894 | 0.3498571 |
| 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2010-04-16 | 0.6350885 | 0.16833667 | 0.8344934 | 0.5288607 | 0.3480677 |
| 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2010-05-07 | 0.7023090 | 0.18186373 | 0.8330233 | 0.5288607 | 0.3862800 |
| 16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2010-05-21 | 0.7442814 | 0.17735471 | 0.8357625 | 0.5288607 | 0.3296688 |
| 17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 2010-05-28 | 0.7874407 | 0.14378758 | 0.8385268 | 0.5288607 | 0.3386493 |
Objective: To determine the impact of factors such as store location, holiday weeks, temperature, fuel prices, CPI, and unemployment on weekly sales.
Regression modelling can be used to determine the factors or features that influence weekly sales. This modelling technique provides p-values and parameter values, which indicate the impact of each factor on weekly sales.
In the case of regression modelling, the response variable would be the weekly sales, while the remaining features are considered predictors. To undertand the dependencies from this model we will use the lm() function.
We will use the train data for modelling purposes and the test data for predicting and evaluating the performance of our model on unseen data to assess its generalisation ability.
#calculating mspe for model using test data
y_true_sales = test_walmart$weekly_sales
cal_mspe = function(model,test_data)
{
y_pred_sales = predict(model,test_data)
spe = (y_true_sales - y_pred_sales)^2
mspe = mean(spe)
return(mspe)
}
#multiple Linear regression full model
f_model = lm(weekly_sales~.-date,data = train_walmart)
cat("FULL MODEL SUMMARY")
summary(f_model)
mspe_fm = cal_mspe(f_model,test_walmart)
cat("--------------MSPE of test data of full model: ",mspe_fm)
FULL MODEL SUMMARY
Call:
lm(formula = weekly_sales ~ . - date, data = train_walmart)
Residuals:
Min 1Q Median 3Q Max
-0.14435 -0.01871 -0.00343 0.01057 0.51594
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.355963 0.031876 11.167 < 2e-16 ***
store2 0.099481 0.005563 17.884 < 2e-16 ***
store3 -0.324251 0.005633 -57.558 < 2e-16 ***
store4 0.187886 0.030396 6.181 6.89e-10 ***
store5 -0.352420 0.005968 -59.055 < 2e-16 ***
store6 -0.008329 0.005765 -1.445 0.148599
store7 -0.256685 0.008908 -28.815 < 2e-16 ***
store8 -0.192300 0.005941 -32.368 < 2e-16 ***
store9 -0.291761 0.005987 -48.732 < 2e-16 ***
store10 0.158865 0.029006 5.477 4.55e-08 ***
store11 -0.058466 0.005653 -10.343 < 2e-16 ***
store12 -0.074143 0.029241 -2.536 0.011258 *
store13 0.174108 0.029732 5.856 5.06e-09 ***
store14 0.148789 0.010896 13.655 < 2e-16 ***
store15 -0.201128 0.027231 -7.386 1.78e-13 ***
store16 -0.285118 0.010159 -28.066 < 2e-16 ***
store17 -0.138739 0.030090 -4.611 4.12e-06 ***
store18 -0.068532 0.026475 -2.589 0.009668 **
store19 0.024478 0.027282 0.897 0.369645
store20 0.160867 0.006475 24.843 < 2e-16 ***
store21 -0.219514 0.005690 -38.581 < 2e-16 ***
store22 -0.097891 0.025792 -3.795 0.000149 ***
store23 -0.006525 0.029887 -0.218 0.827188
store24 0.004407 0.026991 0.163 0.870292
store25 -0.231449 0.006378 -36.288 < 2e-16 ***
store26 -0.100833 0.027176 -3.710 0.000209 ***
store27 0.108922 0.026005 4.188 2.86e-05 ***
store28 0.005826 0.029833 0.195 0.845174
store29 -0.212900 0.025827 -8.243 < 2e-16 ***
store30 -0.308294 0.005552 -55.523 < 2e-16 ***
store31 -0.043153 0.005599 -7.707 1.56e-14 ***
store32 -0.091378 0.008811 -10.370 < 2e-16 ***
store33 -0.291358 0.028927 -10.072 < 2e-16 ***
store34 -0.091080 0.027448 -3.318 0.000913 ***
store35 -0.118392 0.025187 -4.700 2.67e-06 ***
store36 -0.323193 0.005560 -58.126 < 2e-16 ***
store37 -0.283808 0.005745 -49.402 < 2e-16 ***
store38 -0.222887 0.029391 -7.584 4.02e-14 ***
store39 -0.027923 0.005696 -4.903 9.78e-07 ***
store40 -0.134767 0.029891 -4.509 6.68e-06 ***
store41 -0.074629 0.009733 -7.667 2.12e-14 ***
store42 -0.210376 0.029050 -7.242 5.14e-13 ***
store43 -0.233584 0.006564 -35.587 < 2e-16 ***
store44 -0.298257 0.029956 -9.956 < 2e-16 ***
store45 -0.187594 0.010837 -17.310 < 2e-16 ***
holiday_flagholiday_week 0.020612 0.002492 8.271 < 2e-16 ***
temperature -0.016378 0.003829 -4.278 1.93e-05 ***
fuel_price -0.022147 0.004411 -5.021 5.33e-07 ***
cpi 0.064854 0.033062 1.962 0.049873 *
unemployment -0.047097 0.010535 -4.470 7.99e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0428 on 4710 degrees of freedom
Multiple R-squared: 0.9258, Adjusted R-squared: 0.925
F-statistic: 1199 on 49 and 4710 DF, p-value: < 2.2e-16
--------------MSPE of test data of full model: 0.00280065
In the provided detailed model results it is observed that certain p-values exceed the significance threshold of alpha (0.05). These p-values are not statistically significant in predicting the weekly sales, given that the date feature was not included.
We eliminate the features with the highest p-values one at a time and evaluate the corresponding $R^2$, $adjR^2$, and MSPE values.
If there are any substantial changes in those metrics after removal, we may choose not to remove that feature, even if the p-value is high.
Based on the outcome, it is evident that store 24 has the highest p-value. To address this, we will utilise the update function to eliminate store 24 and examine the revised model.
Feature Selection:
model1 = update(f_model,.~. -store24)
summary(model1)
cat("\n-------- MSPE after removing store24",cal_mspe(model1,test_walmart))
Call:
lm(formula = weekly_sales ~ store2 + store3 + store4 + store5 +
store6 + store7 + store8 + store9 + store10 + store11 + store12 +
store13 + store14 + store15 + store16 + store17 + store18 +
store19 + store20 + store21 + store22 + store23 + store25 +
store26 + store27 + store28 + store29 + store30 + store31 +
store32 + store33 + store34 + store35 + store36 + store37 +
store38 + store39 + store40 + store41 + store42 + store43 +
store44 + store45 + holiday_flagholiday_week + temperature +
fuel_price + cpi + unemployment, data = train_walmart)
Residuals:
Min 1Q Median 3Q Max
-0.14435 -0.01870 -0.00342 0.01060 0.51600
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.360971 0.008695 41.514 < 2e-16 ***
store2 0.099367 0.005517 18.010 < 2e-16 ***
store3 -0.324221 0.005630 -57.590 < 2e-16 ***
store4 0.183065 0.007207 25.402 < 2e-16 ***
store5 -0.352639 0.005815 -60.647 < 2e-16 ***
store6 -0.008459 0.005709 -1.482 0.138503
store7 -0.257847 0.005358 -48.121 < 2e-16 ***
store8 -0.192404 0.005906 -32.578 < 2e-16 ***
store9 -0.291852 0.005961 -48.962 < 2e-16 ***
store10 0.154228 0.005906 26.113 < 2e-16 ***
store11 -0.058431 0.005648 -10.346 < 2e-16 ***
store12 -0.078467 0.012400 -6.328 2.72e-10 ***
store13 0.169369 0.006485 26.119 < 2e-16 ***
store14 0.147217 0.005101 28.861 < 2e-16 ***
store15 -0.205475 0.005743 -35.778 < 2e-16 ***
store16 -0.286525 0.005387 -53.189 < 2e-16 ***
store17 -0.143529 0.006708 -21.397 < 2e-16 ***
store18 -0.072759 0.005560 -13.085 < 2e-16 ***
store19 0.020124 0.005787 3.477 0.000511 ***
store20 0.160330 0.005577 28.750 < 2e-16 ***
store21 -0.219631 0.005644 -38.916 < 2e-16 ***
store22 -0.102000 0.005645 -18.068 < 2e-16 ***
store23 -0.011217 0.008238 -1.362 0.173402
store25 -0.231999 0.005416 -42.836 < 2e-16 ***
store26 -0.105170 0.005762 -18.252 < 2e-16 ***
store27 0.104779 0.005688 18.421 < 2e-16 ***
store28 0.001486 0.013550 0.110 0.912675
store29 -0.217011 0.005764 -37.652 < 2e-16 ***
store30 -0.308413 0.005503 -56.041 < 2e-16 ***
store31 -0.043270 0.005553 -7.792 8.03e-15 ***
store32 -0.092534 0.005248 -17.632 < 2e-16 ***
store33 -0.295981 0.005946 -49.780 < 2e-16 ***
store34 -0.095457 0.005910 -16.152 < 2e-16 ***
store35 -0.122410 0.005387 -22.724 < 2e-16 ***
store36 -0.323329 0.005497 -58.821 < 2e-16 ***
store37 -0.283943 0.005684 -49.953 < 2e-16 ***
store38 -0.227240 0.012386 -18.346 < 2e-16 ***
store39 -0.028063 0.005631 -4.984 6.46e-07 ***
store40 -0.139459 0.008232 -16.941 < 2e-16 ***
store41 -0.075978 0.005148 -14.759 < 2e-16 ***
store42 -0.215019 0.005947 -36.157 < 2e-16 ***
store43 -0.233834 0.006382 -36.639 < 2e-16 ***
store44 -0.303025 0.006688 -45.306 < 2e-16 ***
store45 -0.189166 0.004984 -37.955 < 2e-16 ***
holiday_flagholiday_week 0.020635 0.002488 8.295 < 2e-16 ***
temperature -0.016396 0.003827 -4.284 1.87e-05 ***
fuel_price -0.021696 0.003437 -6.312 3.01e-10 ***
cpi 0.059585 0.007221 8.252 < 2e-16 ***
unemployment -0.047865 0.009426 -5.078 3.96e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0428 on 4711 degrees of freedom
Multiple R-squared: 0.9258, Adjusted R-squared: 0.925
F-statistic: 1224 on 48 and 4711 DF, p-value: < 2.2e-16
-------- MSPE after removing store24 0.00280063
Based on the output provided, it is evident that removing the store24 feature did not result in any noticeable changes in the metrics $R^2$, $adjR^2$, and MSPE. Therefore, we will continue analysing the other features that have a p-value lower than the alpha value.
From on the above result, we can observed that the store28 feature has the highest p-value compared to the significant alpha value. Therefore, it will be removed in the next step.
model2 = update(model1,.~. -store28)
summary(model2)
cat("\n-------- MSPE after removing store28",cal_mspe(model2,test_walmart))
Call:
lm(formula = weekly_sales ~ store2 + store3 + store4 + store5 +
store6 + store7 + store8 + store9 + store10 + store11 + store12 +
store13 + store14 + store15 + store16 + store17 + store18 +
store19 + store20 + store21 + store22 + store23 + store25 +
store26 + store27 + store29 + store30 + store31 + store32 +
store33 + store34 + store35 + store36 + store37 + store38 +
store39 + store40 + store41 + store42 + store43 + store44 +
store45 + holiday_flagholiday_week + temperature + fuel_price +
cpi + unemployment, data = train_walmart)
Residuals:
Min 1Q Median 3Q Max
-0.14437 -0.01867 -0.00342 0.01060 0.51599
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.360897 0.008668 41.634 < 2e-16 ***
store2 0.099363 0.005517 18.011 < 2e-16 ***
store3 -0.324203 0.005627 -57.616 < 2e-16 ***
store4 0.183022 0.007195 25.436 < 2e-16 ***
store5 -0.352592 0.005798 -60.812 < 2e-16 ***
store6 -0.008421 0.005698 -1.478 0.139508
store7 -0.257900 0.005335 -48.338 < 2e-16 ***
store8 -0.192344 0.005879 -32.715 < 2e-16 ***
store9 -0.291794 0.005937 -49.149 < 2e-16 ***
store10 0.154085 0.005760 26.749 < 2e-16 ***
store11 -0.058413 0.005645 -10.348 < 2e-16 ***
store12 -0.078720 0.012182 -6.462 1.14e-10 ***
store13 0.169294 0.006448 26.256 < 2e-16 ***
store14 0.147140 0.005052 29.125 < 2e-16 ***
store15 -0.205589 0.005648 -36.401 < 2e-16 ***
store16 -0.286501 0.005382 -53.232 < 2e-16 ***
store17 -0.143583 0.006689 -21.466 < 2e-16 ***
store18 -0.072900 0.005410 -13.476 < 2e-16 ***
store19 0.020010 0.005692 3.515 0.000443 ***
store20 0.160326 0.005576 28.753 < 2e-16 ***
store21 -0.219635 0.005643 -38.922 < 2e-16 ***
store22 -0.102109 0.005557 -18.374 < 2e-16 ***
store23 -0.011209 0.008237 -1.361 0.173647
store25 -0.232001 0.005415 -42.841 < 2e-16 ***
store26 -0.105272 0.005687 -18.510 < 2e-16 ***
store27 0.104667 0.005595 18.707 < 2e-16 ***
store29 -0.217188 0.005532 -39.261 < 2e-16 ***
store30 -0.308418 0.005503 -56.049 < 2e-16 ***
store31 -0.043273 0.005552 -7.794 7.94e-15 ***
store32 -0.092591 0.005221 -17.734 < 2e-16 ***
store33 -0.296130 0.005786 -51.181 < 2e-16 ***
store34 -0.095639 0.005671 -16.864 < 2e-16 ***
store35 -0.122544 0.005244 -23.367 < 2e-16 ***
store36 -0.323345 0.005494 -58.849 < 2e-16 ***
store37 -0.283957 0.005682 -49.973 < 2e-16 ***
store38 -0.227492 0.012170 -18.693 < 2e-16 ***
store39 -0.028077 0.005629 -4.988 6.31e-07 ***
store40 -0.139451 0.008231 -16.942 < 2e-16 ***
store41 -0.075974 0.005147 -14.760 < 2e-16 ***
store42 -0.215162 0.005803 -37.080 < 2e-16 ***
store43 -0.233933 0.006318 -37.028 < 2e-16 ***
store44 -0.303089 0.006662 -45.493 < 2e-16 ***
store45 -0.189243 0.004933 -38.359 < 2e-16 ***
holiday_flagholiday_week 0.020642 0.002487 8.302 < 2e-16 ***
temperature -0.016359 0.003812 -4.291 1.81e-05 ***
fuel_price -0.021620 0.003367 -6.421 1.49e-10 ***
cpi 0.059472 0.007146 8.322 < 2e-16 ***
unemployment -0.047617 0.009150 -5.204 2.03e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0428 on 4712 degrees of freedom
Multiple R-squared: 0.9258, Adjusted R-squared: 0.9251
F-statistic: 1251 on 47 and 4712 DF, p-value: < 2.2e-16
-------- MSPE after removing store28 0.002800619
model3 = update(model2,.~. -store23)
summary(model3)
cat("\n-------- MSPE after removing store23",cal_mspe(model3,test_walmart))
Call:
lm(formula = weekly_sales ~ store2 + store3 + store4 + store5 +
store6 + store7 + store8 + store9 + store10 + store11 + store12 +
store13 + store14 + store15 + store16 + store17 + store18 +
store19 + store20 + store21 + store22 + store25 + store26 +
store27 + store29 + store30 + store31 + store32 + store33 +
store34 + store35 + store36 + store37 + store38 + store39 +
store40 + store41 + store42 + store43 + store44 + store45 +
holiday_flagholiday_week + temperature + fuel_price + cpi +
unemployment, data = train_walmart)
Residuals:
Min 1Q Median 3Q Max
-0.14474 -0.01862 -0.00323 0.01047 0.51551
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.351558 0.005296 66.388 < 2e-16 ***
store2 0.099193 0.005516 17.983 < 2e-16 ***
store3 -0.323984 0.005625 -57.596 < 2e-16 ***
store4 0.189343 0.005496 34.454 < 2e-16 ***
store5 -0.351128 0.005698 -61.624 < 2e-16 ***
store6 -0.007380 0.005647 -1.307 0.191
store7 -0.258114 0.005334 -48.395 < 2e-16 ***
store8 -0.190762 0.005764 -33.097 < 2e-16 ***
store9 -0.290270 0.005831 -49.781 < 2e-16 ***
store10 0.157372 0.005230 30.090 < 2e-16 ***
store11 -0.058186 0.005643 -10.311 < 2e-16 ***
store12 -0.078943 0.012182 -6.480 1.01e-10 ***
store13 0.174485 0.005199 33.562 < 2e-16 ***
store14 0.147126 0.005052 29.119 < 2e-16 ***
store15 -0.202181 0.005063 -39.936 < 2e-16 ***
store16 -0.284047 0.005071 -56.009 < 2e-16 ***
store17 -0.137804 0.005168 -26.666 < 2e-16 ***
store18 -0.070452 0.005102 -13.808 < 2e-16 ***
store19 0.023430 0.005108 4.587 4.61e-06 ***
store20 0.160699 0.005570 28.852 < 2e-16 ***
store21 -0.219809 0.005642 -38.959 < 2e-16 ***
store22 -0.098901 0.005033 -19.651 < 2e-16 ***
store25 -0.231608 0.005408 -42.826 < 2e-16 ***
store26 -0.101610 0.005011 -20.278 < 2e-16 ***
store27 0.107849 0.005083 21.216 < 2e-16 ***
store29 -0.215953 0.005457 -39.571 < 2e-16 ***
store30 -0.308610 0.005501 -56.098 < 2e-16 ***
store31 -0.043436 0.005551 -7.824 6.25e-15 ***
store32 -0.092798 0.005219 -17.780 < 2e-16 ***
store33 -0.293017 0.005315 -55.132 < 2e-16 ***
store34 -0.094085 0.005555 -16.936 < 2e-16 ***
store35 -0.120218 0.004958 -24.246 < 2e-16 ***
store36 -0.323792 0.005485 -59.031 < 2e-16 ***
store37 -0.284365 0.005675 -50.110 < 2e-16 ***
store38 -0.227653 0.012171 -18.705 < 2e-16 ***
store39 -0.028519 0.005620 -5.075 4.03e-07 ***
store40 -0.132058 0.006183 -21.357 < 2e-16 ***
store41 -0.074154 0.004971 -14.917 < 2e-16 ***
store42 -0.211851 0.005268 -40.212 < 2e-16 ***
store43 -0.236613 0.006003 -39.413 < 2e-16 ***
store44 -0.297534 0.005266 -56.502 < 2e-16 ***
store45 -0.189265 0.004934 -38.360 < 2e-16 ***
holiday_flagholiday_week 0.020663 0.002487 8.309 < 2e-16 ***
temperature -0.016331 0.003812 -4.284 1.87e-05 ***
fuel_price -0.020248 0.003213 -6.302 3.20e-10 ***
cpi 0.064860 0.005950 10.901 < 2e-16 ***
unemployment -0.039240 0.006770 -5.796 7.22e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0428 on 4713 degrees of freedom
Multiple R-squared: 0.9258, Adjusted R-squared: 0.925
F-statistic: 1278 on 46 and 4713 DF, p-value: < 2.2e-16
-------- MSPE after removing store23 0.002806772
model4 = update(model3,.~. -store6)
summary(model4)
cat("\n-------- MSPE after removing store6",cal_mspe(model4,test_walmart))
Call:
lm(formula = weekly_sales ~ store2 + store3 + store4 + store5 +
store7 + store8 + store9 + store10 + store11 + store12 +
store13 + store14 + store15 + store16 + store17 + store18 +
store19 + store20 + store21 + store22 + store25 + store26 +
store27 + store29 + store30 + store31 + store32 + store33 +
store34 + store35 + store36 + store37 + store38 + store39 +
store40 + store41 + store42 + store43 + store44 + store45 +
holiday_flagholiday_week + temperature + fuel_price + cpi +
unemployment, data = train_walmart)
Residuals:
Min 1Q Median 3Q Max
-0.14481 -0.01869 -0.00332 0.01050 0.51544
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.350998 0.005279 66.495 < 2e-16 ***
store2 0.102680 0.004828 21.267 < 2e-16 ***
store3 -0.320253 0.004847 -66.073 < 2e-16 ***
store4 0.189449 0.005495 34.475 < 2e-16 ***
store5 -0.347363 0.004917 -70.652 < 2e-16 ***
store7 -0.255755 0.005019 -50.957 < 2e-16 ***
store8 -0.186858 0.004930 -37.906 < 2e-16 ***
store9 -0.286360 0.005005 -57.212 < 2e-16 ***
store10 0.156972 0.005221 30.063 < 2e-16 ***
store11 -0.054446 0.004863 -11.195 < 2e-16 ***
store12 -0.079825 0.012164 -6.562 5.87e-11 ***
store13 0.174407 0.005199 33.546 < 2e-16 ***
store14 0.149123 0.004816 30.963 < 2e-16 ***
store15 -0.202269 0.005062 -39.954 < 2e-16 ***
store16 -0.281309 0.004619 -60.903 < 2e-16 ***
store17 -0.137804 0.005168 -26.664 < 2e-16 ***
store18 -0.070644 0.005100 -13.850 < 2e-16 ***
store19 0.023337 0.005108 4.569 5.03e-06 ***
store20 0.163865 0.005016 32.669 < 2e-16 ***
store21 -0.216325 0.004973 -43.501 < 2e-16 ***
store22 -0.098789 0.005033 -19.630 < 2e-16 ***
store25 -0.228454 0.004840 -47.199 < 2e-16 ***
store26 -0.101639 0.005011 -20.282 < 2e-16 ***
store27 0.107923 0.005083 21.231 < 2e-16 ***
store29 -0.216308 0.005451 -39.682 < 2e-16 ***
store30 -0.305133 0.004816 -63.363 < 2e-16 ***
store31 -0.039950 0.004869 -8.205 2.95e-16 ***
store32 -0.090421 0.004892 -18.482 < 2e-16 ***
store33 -0.293438 0.005305 -55.309 < 2e-16 ***
store34 -0.094643 0.005539 -17.085 < 2e-16 ***
store35 -0.120217 0.004959 -24.244 < 2e-16 ***
store36 -0.320391 0.004829 -66.347 < 2e-16 ***
store37 -0.280954 0.005039 -55.753 < 2e-16 ***
store38 -0.228550 0.012152 -18.807 < 2e-16 ***
store39 -0.025122 0.004983 -5.042 4.79e-07 ***
store40 -0.131577 0.006173 -21.315 < 2e-16 ***
store41 -0.071504 0.004539 -15.753 < 2e-16 ***
store42 -0.212251 0.005260 -40.353 < 2e-16 ***
store43 -0.233858 0.005622 -41.599 < 2e-16 ***
store44 -0.297558 0.005266 -56.502 < 2e-16 ***
store45 -0.187268 0.004692 -39.914 < 2e-16 ***
holiday_flagholiday_week 0.020678 0.002487 8.315 < 2e-16 ***
temperature -0.016430 0.003812 -4.310 1.66e-05 ***
fuel_price -0.019487 0.003160 -6.167 7.53e-10 ***
cpi 0.060641 0.004999 12.131 < 2e-16 ***
unemployment -0.038065 0.006710 -5.673 1.49e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0428 on 4714 degrees of freedom
Multiple R-squared: 0.9257, Adjusted R-squared: 0.925
F-statistic: 1306 on 45 and 4714 DF, p-value: < 2.2e-16
-------- MSPE after removing store6 0.002804798
After removing the store6 column, there is no change in the coefficients of determination ($R^2$), adjusted coefficients of determination ($adjR^2$), and mean squared prediction error (MSPE). Additionally, there are no variables with the highest p-value in the model.This model can be considered the most optimal based on the alpha value approach.
The features store24, store28, store23, and store6 have not siginificant to predict the weekly sales.
Interpretation of Regression model results:
The regression modelling results indicate that each feature has a distinct coefficient. The value of each feature/predictor informs us about the magnitude of the change in weekly sales when there is a one-unit change in the predictors.
First,lets see the locations of the stores. Each store has distinct values, with some being positive and others being negative. The positive values indicate that these stores perform better than the baseline store (store1), while the negative values indicate that these stores perform worse than the baseline store.
The presence of a positive value for the holidayflag (0.0206) suggests that, on average, all stores experience a sales increase during the holidays. Additionally, we can determine the precise impact of holidays on individual stores by using the interaction terms between stores and the holidayflag.
The negative values for temperature (-0.016), fuel price (-0.019), and unemployment (-0.03) indicate that an increase in these factors leads to a decrease in weekly sales. And that the CPI (0.06) has a positive value, indicating that an increase in the CPI value leads to a decrease in the weekly sales.
Additionally,the higher values of $R^2$ (0.9257) and $adjR^2$ (0.925), which indicates the model's ability to capture a significant amount of variance and thus shows a strong fit.
By utilising the model, we are able to forecast the weekly sales using unseen data, specifically the test data. This yields a mspe value of 0.0028, indicating that our model is effectively predicting to the unseen data.
Objective: Using diagnostic plots to evaluate the assumptions of the regression model built to predict weekly sales at Walmart, and to identify any anomalies that might affect the model's accuracy.
Goodness of Fit:
Feature Selection:
In the regression modelling, we conducted feature selection using an alpha value. We chose the features that are most significant for predicting weekly sales. which includes all features except store6, store23, store24, and store28.
The four features mentioned above have p-values that are higher than the significance value alpha (0.05), therefore we have excluded them. After eliminating those features, we obtained the optimal model which accounts for 92.5% of the variability in weekly sales.
Residual Analysis:
As we develop a multiple linear regression (MLR) model, we need to ensure that our final model follows to the assumptions of MLR. To do this, we will perform residual analysis, which involves checking the normality, linearity, constant variance, and independence of the residuals, using diagnostic plots.
We will be using the
#extracting the fitted, residuals of the models
y_train_sales_predict = fitted(model4)
residuals = resid(model4)
y_true_sales = train_walmart$weekly_sales
#observed vs fitted plot
ggplot()+
geom_point(aes(x = y_true_sales,y = y_train_sales_predict))+
geom_smooth(aes(x = y_true_sales,y = y_train_sales_predict),se = F,color = 'blue')+
geom_abline(intercept = 0,slope=1,color='red')+
ggtitle("Observed vs Fitted")+
xlab("Observed")+
ylab("Fitted")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The Observed vs Fitted plot can be used to assess the linearity assumption of our model. In the plot above, the blue curve represents the trend, while the red line represents the ideal situation. The blue curve diverges from the red line, which indicates less accuracy at the extremes. Our model exhibits linearity in the middle range, but deviates from accuracy at low and high values, which tells the need for a non-linear model.
In order to enhance the performance, we need to transform the target variable.
The model follows to the assumption of linearity within the middle range, but deviates from it at the extremes.
#Residual vs fitted
ggplot()+
geom_point(aes(x= y_train_sales_predict,y=residuals))+
geom_smooth(aes(x= y_train_sales_predict,y=residuals),se=F,color='red')+
ggtitle("Residual vs fitted")+
xlab("Fitted values")+
ylab("Residuals")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The Residual vs Fitted plot is used to assess the assumption of constant variance (homoscedasticity) in our multiple linear regression (MLR) model. If a model follows to the constant variance assumption, the data points will be observed as randomly distributed around the horizontal axis, without having any discernible patterns.
The plots above indicate a pattern in which the residuals become more spread out as the fitted values increase. This suggests that the variance of the residuals is not constant.
The model fails to follow to the assumption of constant variance.
#Q-Q plot
s = (residuals - mean(residuals))/sd(residuals)
ggplot()+
stat_qq(aes(sample = s))+
geom_abline(slope = 1, intercept = 0, col = "blue")+
ggtitle("Q-Q plot")
The Q-Q plot allows us to determine whether our model follows to the assumption of normality. If the residuals, which are the data points in the plot, are perfectly aligned with the blue 45-degree straight line, we can conclude that the multiple linear regression (MLR) model is not violating the assumption of normality.
The plot above shows that the data points (residuals) are not following a straight line, which suggests that the multiple linear regression (MLR) model is not meeting the assumption of normality.
The models Violates(not follows) the Normality assumption.
#plotting residual vs index for independence as well.
x = 1:length(y_true_sales)
ggplot() +
geom_point(aes(x = x, y = residuals),alpha = 0.5) +
geom_abline(slope = 0, intercept = 0,color='blue')+
xlab("Index") +
geom_smooth(aes(x = x, y = residuals),se = F) +
ylab("Residuals") +
ggtitle("Residual vs index plot")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The independence assumption of our model is satisfied when the residuals exhibit a random distribution around zero, as observed from the Residual vs Index plot.If the errors are distributed randomly, there will be no correlation between them.
The above plot of the MLR model indicates that the residuals are distributed randomly around zero. Therefore, we can confirm that our model follows to the independent assumption.
The model follows the Independence assumption.
Objective: To compare different regression models and select the one that provides the best predictive accuracy.
The regression model we build is based on the alpha value that is statistically significant. We eliminated features with p-values greater than the alpha value.
However, using the regsubsets() function in R, we can obtain the optimal combination of features for each size, ranging from size 1 to the total number of predictors, based on the model selection statistic. The optimal model selection for each size is determined using the Sum of Squared Errors (SSE) metric in the function. Additionally, the selection method employed can be either backward selection, forward selection, or exhaustive selection of features. The function object provides a table that displays logical values (true or false) for each feature, indicating whether the feature is included or not.
colnames(train_walmart)[-51]
length(colnames(train_walmart))
n_feature = length(colnames(train_walmart)) - 1
models = regsubsets(weekly_sales~.-date,method = "backward", data = train_walmart,nvmax = n_feature)
summary_models = summary(models)
summary_df = summary_models$which
summary_df
| (Intercept) | store2 | store3 | store4 | store5 | store6 | store7 | store8 | store9 | store10 | ⋯ | store41 | store42 | store43 | store44 | store45 | holiday_flagholiday_week | temperature | fuel_price | cpi | unemployment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | ⋯ | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 2 | TRUE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | ⋯ | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 3 | TRUE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | ⋯ | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 4 | TRUE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | ⋯ | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 5 | TRUE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | ⋯ | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 6 | TRUE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | ⋯ | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 7 | TRUE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | ⋯ | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 8 | TRUE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | ⋯ | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 9 | TRUE | TRUE | FALSE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | ⋯ | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 10 | TRUE | TRUE | FALSE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | ⋯ | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 11 | TRUE | TRUE | FALSE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | ⋯ | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 12 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | ⋯ | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 13 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | ⋯ | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 14 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | ⋯ | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 15 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | ⋯ | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 16 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | ⋯ | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 17 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | TRUE | TRUE | ⋯ | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 18 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | TRUE | TRUE | ⋯ | FALSE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 19 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | FALSE | TRUE | TRUE | ⋯ | FALSE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 20 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | FALSE | TRUE | TRUE | ⋯ | FALSE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 21 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | FALSE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 22 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | FALSE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 23 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | FALSE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 24 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | FALSE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 25 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | FALSE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 26 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 27 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 28 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 29 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 30 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 31 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 32 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 33 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 34 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 35 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 36 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 37 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE |
| 38 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE | FALSE |
| 39 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE | FALSE |
| 40 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE | FALSE |
| 41 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE | TRUE |
| 42 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE |
| 43 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE |
| 44 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE |
| 45 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | ⋯ | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE |
| 46 | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | ⋯ | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE |
| 47 | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | ⋯ | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE |
| 48 | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | ⋯ | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE |
| 49 | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | ⋯ | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE |
#finding out the best model based on bic, adjr2 and r2criteria
best_bic_model = which.min(summary_models$bic)
cat("The model which is best according to BIC:",best_bic_model)
best_adjr2_model = which.max(summary_models$adjr2)
cat("\nThe model which is best according to AdjR2:",best_adjr2_model)
best_r2_model = which.max(summary_models$rsq)
cat("\nThe model which is best according to R2:",best_r2_model)
The model which is best according to BIC: 45 The model which is best according to AdjR2: 47 The model which is best according to R2: 49
#selecting the best model based on aic criteria
n = nrow(train_walmart)
AIC = 2*(2:50) + n*log(summary_models$rss/n)
plot(AIC ~ I(1:49), xlab = "size of independent variables ", ylab = "AIC",main = "AIC vs. Number of Independent Variables")
best_aic_model = which.min(AIC)
cat("The model which is best according to AIC:",best_aic_model)
The model which is best according to AIC: 45
cat("Best features selected by the AIC:",colnames(summary_df)[summary_df[best_aic_model,]][-1])
cat("\nBest features selected by the BIC:",colnames(summary_df)[summary_df[best_bic_model,]][-1])
cat("\nBest features selected by the ADJR2:",colnames(summary_df)[summary_df[best_adjr2_model,]][-1])
cat("\nBest features selected by the R2:",colnames(summary_df)[summary_df[best_r2_model,]][-1])
Best features selected by the AIC: store2 store3 store4 store5 store7 store8 store9 store10 store11 store12 store13 store14 store15 store16 store17 store18 store19 store20 store21 store22 store25 store26 store27 store29 store30 store31 store32 store33 store34 store35 store36 store37 store38 store39 store40 store41 store42 store43 store44 store45 holiday_flagholiday_week temperature fuel_price cpi unemployment Best features selected by the BIC: store2 store3 store4 store5 store7 store8 store9 store10 store11 store12 store13 store14 store15 store16 store17 store18 store19 store20 store21 store22 store25 store26 store27 store29 store30 store31 store32 store33 store34 store35 store36 store37 store38 store39 store40 store41 store42 store43 store44 store45 holiday_flagholiday_week temperature fuel_price cpi unemployment Best features selected by the ADJR2: store2 store3 store4 store5 store6 store7 store8 store9 store10 store11 store12 store13 store14 store15 store16 store17 store18 store19 store20 store21 store22 store23 store25 store26 store27 store29 store30 store31 store32 store33 store34 store35 store36 store37 store38 store39 store40 store41 store42 store43 store44 store45 holiday_flagholiday_week temperature fuel_price cpi unemployment Best features selected by the R2: store2 store3 store4 store5 store6 store7 store8 store9 store10 store11 store12 store13 store14 store15 store16 store17 store18 store19 store20 store21 store22 store23 store24 store25 store26 store27 store28 store29 store30 store31 store32 store33 store34 store35 store36 store37 store38 store39 store40 store41 store42 store43 store44 store45 holiday_flagholiday_week temperature fuel_price cpi unemployment
#selecting the best model out of 49 models based on the MSPE value
y_true = test_walmart$weekly_sales
mspes = c()
for(i in 1:49)
{
features = colnames(summary_df)[summary_df[i,]][-1]
model = as.formula(paste("weekly_sales ~",paste(features,collapse = "+")))
model = lm(model,data =train_walmart)
y_pred = predict(model,test_walmart)
spe = (y_true - y_pred)^2
mspe = mean(spe)
mspes = c(mspes,mspe)
}
cat("MSPE values of all models for each size:",mspes)
MSPE values of all models for each size: 0.02515914 0.02322037 0.02083639 0.01846343 0.01676143 0.01509006 0.01293913 0.01228852 0.01166537 0.01077802 0.01030197 0.009858429 0.009448515 0.009109986 0.008788179 0.008190549 0.007686641 0.00722546 0.006662553 0.006175906 0.005650391 0.005205573 0.005009935 0.004551413 0.004367654 0.004170968 0.004021551 0.003855346 0.003761991 0.003594804 0.003459233 0.003297636 0.003145998 0.003059344 0.003030584 0.003003628 0.002986306 0.002956919 0.002919502 0.002889666 0.002882825 0.002858084 0.002849378 0.002827249 0.002804798 0.002806772 0.002800619 0.00280063 0.00280065
best_mspe = which.min(mspes)
cat("Best model size using MSPE is:",best_mspe , "with MSPE:", mspes[best_mspe], "\n")
cat("Best features selected by the MSPE:",colnames(summary_df)[summary_df[best_mspe,]][-1])
Best model size using MSPE is: 47 with MSPE: 0.002800619 Best features selected by the MSPE: store2 store3 store4 store5 store6 store7 store8 store9 store10 store11 store12 store13 store14 store15 store16 store17 store18 store19 store20 store21 store22 store23 store25 store26 store27 store29 store30 store31 store32 store33 store34 store35 store36 store37 store38 store39 store40 store41 store42 store43 store44 store45 holiday_flagholiday_week temperature fuel_price cpi unemployment
Interpretation of Model selection results :
The results above indicate that various models selected different techniques.
According to the Bayesian Information Criterion (BIC) technique, the optimal model would consist of 45 features, including all features except store6, store23, store24, and store28.
According to the adjusted R-squared $(adjR^2)$, which is a statistical measure that considers the number of predictors in a model to give a more precise evaluation of model complexity, the optimal model would include 47features, excluding store24 and store28.
The Akaike Information Criterion (AIC) is a method used to select a model that minimises information loss and prioritises the identification of a reliable predictive model. The optimal model, as determined by this technique, includes 45 features. This model includes all features except for store6, store23, store24, and store28.
The optimal model, determined by the Mean Square Error (MSPE), includes 47 features, excluding store24 and store28.
The optimal model, determined by the $R^2$, includes 49 features, it includes all features.
The model selection techniques (AIC, BIC, AdjR2, MSPE) have not included the features of store6, store23, store24, and store28 into the optimal models. From these findings, it can be concluded that store6, store23, store24, and store28 possess minimal significance in predicting the weekly sales.
Objective: To determine whether a generalized linear model could provide a better fit for the Walmart dataset considering the distribution of weekly sales.
The distribution of the weekly sales response variable plot indicates that it deviates from a normal distribution.The distribution exhibits a right-skewed pattern which is frequently observed in sales data due to the presence of a few weeks with exceptionally high sales and a majority of weeks with moderate sales.
In contrast to linear regression, which assumes a linear relationship between the variables, the generalised linear model provides greater flexibility in representing non-linear relationships and handling response variables that are not normally distributed.
We are going to use the Gamma distribution as our dependent variable is continuous, right-skewed, and has positive values. We will employ a square root link function using Generalised Linear Models (GLM).
#function to calculate the MSPE
y_true_sales = test_walmart$weekly_sales
cal_mspe = function(model,test_data)
{
y_pred_sales = predict(model,test_data)
spe = (y_true_sales - y_pred_sales)^2
mspe = mean(spe)
return(mspe)
}
#using glm model from ass library and sqrt link function with gamma distrubtion on resonse variable
glm_model = glm(weekly_sales ~ . -date, family = Gamma(link = "sqrt"), data = train_walmart)
summary(glm_model)
Call:
glm(formula = weekly_sales ~ . - date, family = Gamma(link = "sqrt"),
data = train_walmart)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.598505 0.025122 23.824 < 2e-16 ***
store2 0.076339 0.008750 8.724 < 2e-16 ***
store3 -0.382242 0.006271 -60.958 < 2e-16 ***
store4 0.133574 0.025036 5.335 9.99e-08 ***
store5 -0.442054 0.006213 -71.151 < 2e-16 ***
store6 -0.003705 0.008318 -0.445 0.656023
store7 -0.287599 0.008624 -33.350 < 2e-16 ***
store8 -0.176133 0.007327 -24.040 < 2e-16 ***
store9 -0.310168 0.006700 -46.292 < 2e-16 ***
store10 0.107690 0.024058 4.476 7.78e-06 ***
store11 -0.048024 0.007964 -6.030 1.77e-09 ***
store12 -0.105925 0.025300 -4.187 2.88e-05 ***
store13 0.122511 0.024572 4.986 6.39e-07 ***
store14 0.106809 0.011691 9.136 < 2e-16 ***
store15 -0.240278 0.022005 -10.919 < 2e-16 ***
store16 -0.316305 0.009178 -34.462 < 2e-16 ***
store17 -0.149986 0.024149 -6.211 5.73e-10 ***
store18 -0.084563 0.021809 -3.877 0.000107 ***
store19 0.004488 0.022544 0.199 0.842203
store20 0.119872 0.009794 12.240 < 2e-16 ***
store21 -0.219801 0.007001 -31.396 < 2e-16 ***
store22 -0.110240 0.021176 -5.206 2.01e-07 ***
store23 -0.012263 0.024241 -0.506 0.612975
store24 -0.014463 0.022317 -0.648 0.516952
store25 -0.236455 0.007255 -32.593 < 2e-16 ***
store26 -0.112808 0.022194 -5.083 3.86e-07 ***
store27 0.073089 0.021805 3.352 0.000809 ***
store28 -0.025568 0.027427 -0.932 0.351270
store29 -0.271904 0.021110 -12.881 < 2e-16 ***
store30 -0.358278 0.006297 -56.897 < 2e-16 ***
store31 -0.036739 0.008029 -4.576 4.86e-06 ***
store32 -0.088673 0.009444 -9.389 < 2e-16 ***
store33 -0.457625 0.023131 -19.784 < 2e-16 ***
store34 -0.114475 0.022612 -5.063 4.29e-07 ***
store35 -0.136721 0.020720 -6.598 4.61e-11 ***
store36 -0.393690 0.006182 -63.679 < 2e-16 ***
store37 -0.315797 0.006542 -48.271 < 2e-16 ***
store38 -0.317296 0.023203 -13.675 < 2e-16 ***
store39 -0.024878 0.008214 -3.029 0.002470 **
store40 -0.136077 0.023902 -5.693 1.32e-08 ***
store41 -0.066261 0.009947 -6.662 3.02e-11 ***
store42 -0.264167 0.023373 -11.302 < 2e-16 ***
store43 -0.258161 0.007169 -36.012 < 2e-16 ***
store44 -0.420981 0.023738 -17.734 < 2e-16 ***
store45 -0.198688 0.010172 -19.533 < 2e-16 ***
holiday_flagholiday_week 0.014098 0.002103 6.704 2.27e-11 ***
temperature -0.009144 0.003092 -2.957 0.003121 **
fuel_price -0.013801 0.003572 -3.864 0.000113 ***
cpi 0.035904 0.025933 1.384 0.166275
unemployment -0.019811 0.007123 -2.781 0.005434 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 0.04304075)
Null deviance: 3006.32 on 4759 degrees of freedom
Residual deviance: 185.77 on 4710 degrees of freedom
AIC: -18748
Number of Fisher Scoring iterations: 9
cat("MSPE of GLM model is",cal_mspe(glm_model,test_walmart))
MSPE of GLM model is 0.05043822
Interpretation of the results of GLM:
We have used the square root link function which involves applying a transformation to the expected value of the response variable.
The results of the Generalised Linear Model (GLM) indicate that most of the store variables have a significant impact on weekly sales, as we can see the coefficients.
Additionally, the Holidayflag has a positive value that holds significance to weekly sales, similar to the Multiple Linear Regression (MLR) model.
The variables temperature and fuel prices, which have negative coefficients, have a lower level of significance on the weekly sales.
The CPI, with a p-value higher than the significant alpha value of 0.05, may not have a measurable impact on sales.
In addition to the CPI, there are other features with p-values greater than 0.05 that do not have an impact on sales. We can eliminate them as we did in MLR.
The Null deviance and Residual deviance provide information about the goodness of fit of the model. A model that fits the data well will have a residual deviance that is close to the null deviance. In our case, there is a significant decrease from null to residual deviance, indicating a strong fit.
The Akaike Information Criterion (AIC) is a metric that quantifies the model's quality. Typically, lower values are more preferable.
The MLR model has a significantly lower MSPE value compared to the GLM model, suggesting that the MLR model has better predictive accuracy in terms of MSPE.
Furthermore, it is evident that our MLR model doesn't follow to the normality assumption. Despite having a lower mean squared prediction error (MSPE), the GLM model is still the optimal choice, even with a higher MSPE of 0.05, as it does not make the assumption of normality for residuals.
I'm interested in analysing Walmart sales data because it allows us to see broader economic trends and consumer behaviours. This information is not only critical for understanding market dynamics, but it also serves as an indicator of the overall health of the retail sector, making it necessary for accurate economic forecasting and analysis. This type of research helps to close broadening our understanding of economic patterns.
Walmart's revenues share a lot of useful information that goes beyond just transactions. For example, they show how changes in the economy or weather affect people's spending. These numbers are useful for making business decisions, improving marketing strategies, and planning finances.
Retail analytics research has consistently identified a variety of factors that influence sales, such as holidays influencing shopping behaviours, economic downturns correlated with less consumer spending, and location analytics showing how geographical factors can influence store performance due to differences in demographics and competition levels. Existing research allows us to better understand these factors.
For the analysis, I gathered the Walmart data from Kaggle, a platform hosting datasets for research and analytical competitions. The data was gathered through an observational study without manipulating variables for research purposes.
Problem Statement
The primary objective of this analysis is to determine and measure the factors that substantially impact weekly sales at Walmart stores, including seasonal variations, holidays, and economic circumstances that may impact consumer expenditure. In this project Iaim to determine the impact of each variable on sales fluctuations and assess their potential for predicting future sales patterns. Others, include the precise impact of specific holidays on sales, the influence of economic indicators on consumer behavior, such as the CPI and fuel prices, and variations in sales between Walmart store locations.
I have obtained the Walmart sales data for the project from Kaggle, a well-known platform known for hosting different datasets used in research and machine learning competitions. The information wasn't gathered through web scraping or experimental design; instead, Walmart got it from a secondary source by observing how the business worked.
Dataset Link: https://www.kaggle.com/datasets/yasserh/walmart-dataset
Through the above exploratory analysis of Walmart's sales data, I have found important results that are related to the main problem question about the factors that affect weekly sales. According to our results, seasonal patterns and promotional activities have a big effect on changes in sales. This is shown by the fact that sales go up a lot during certain weeks usually around major holidays. Also looking at sales data from different Walmart stores shows big differences. This means that things that are unique to each store like its location, size, and the people who live in the area have a big effect on how well it does. The boxplot analysis shows that sales consistently go up around holidays with fewer extreme values. This provides further proof that holidays do have an effect. This means that sales will be more stable during the holidays. I also looked at the relationship between unemployment rates and sales and found a possible although weak inverse correlation. This suggests that economic factors also play a role though not as much as store-specific or seasonal factors. The pattern of weekly sales is right-skewed from the plot which means that some weeks will have surprisingly high sales. This shows how important it is to have forecasting models that can deal with these kinds of anomalies.
var.test() function. From the results its is evident that a significant difference between those two period having evidence with a p-value(0.001027) which is lower than the standard alpha level of 0.05. Holiday sales showed greater variability compared to non-holiday periods, possibly due to increased promotional activities.aov() function. Performed several test like the Levene's test revealed significant variance discrepancies among the groups, indicating a violation of the assumption of equal variances. The Shapiro-Wilk test on the residual sample indicated that the data did not follow a normal distribution, which is common in real-world data particularly in sales data that frequently includes outliers and skewness.Walmart's weekly sales are influenced by factors like store location, holidays, weather, fuel prices, the CPI, and unemployment rates. I used multiple regression model to measure these variables and predict sales changes for each unit increase in each predictor.The next statistical analysis method will explain the assumptions for this model and the respective diagnostics.
The model showed sales usually increase during holiday weeks,it may be due to promotions on holidays like christmas or due to offers.The negative coefficients for unemployment, fuel prices, and temperature tells us that in one unit increase in these feature the sales value decreases due to lower consumer purchasing power. The CPI positively impacts sales in the areas with high consumer price. The model's minimal mean squared prediction error (MSPE) tells us how good the model in predicting the unseen data. The model captured 92.57% of the variance in weekly sales.
I used the diagnostic plots to validate the model's assumptions to ensure the above regression model effectively predicts Walmart's weekly sales. These plots are used to check the presence of linearity, constant variance, normality, and independence of residuals, which are important for ensuring the model's integrity.
Diagnostic plots shows violations in linear regression model assumptions, including non-linearity and heteroscedasticity. we can handle non-linearity and non-constant variance by transforming response variables or using non-linear models. And also resolving normality could involve transformations or robust regression techniques. but there is no autocorrelation between the features.
A thorough analysis of Walmart store sales data makes it clear that holidays, store locations, and economic indicators like fuel prices and the consumer price index (CPI) are some of the most significant factors that affect weekly sales. Holidays, especially Thanksgiving and Christmas, have a big effect on sales, which suggests that seasonal promotions and purchasing behaviours have a big effect. Location of the store also impacts sales, showing that demographic and regional factors are very important for profitable sales.
I learned how important it is to make decisions in retail operations based on data from this project. Knowing how to predict and understand sales patterns based on things like holidays and economic factors can help with planning strategies and allocating resources. This analysis also shows how important it is to have different strategies for each store location, taking into account the store's specific features and the different kinds of customers that shop there. And also from this project it is clear how important it is to use statistical methods to get useful information from large datasets, which will ultimately help businesses make better decisions.
Further research could help us understand how Walmart's sales work by including more factors like in-store promotions, the effect of online sales, and behavioural data about customers. Looking into how online and offline sales platforms work together would help us understand how multichannel selling affects sales in regular stores. Comparing business to competitors like costco could also help us find unique market advantages or places where we can improve.
Looking into more complex predictive modelling methods, like machine learning algorithms, could help improve sales predictions and give more detailed information about what drives sales. Some methods,like neural networks and random forests,may show complex patterns and interactions that are hidden by more common statistical methods. Lastly,includes real-time data analysis could help us better understand how trends change over time and plan our long-term sales strategies.
Referenced used for this project